Main metrics that we are going to cover in this project, with definitions
Launch Angle: The vertical angle at which the ball leaves a player’s bat after being struck Exit Velocity: The speed of the baseball as it leaves the player’s bat after being struck (in mph) Batting Average: Hits/At bats. The percentage of times a hitter is reaching base via a hit Weighted on base average (wOBA): Measures a player’s overall offensive contributions per plate appearance by taking the type of outcome into account Hard-hit ball: Defined by Statcast as coming off the bat at >= 95 MPH Spin Rate: Total numbers of revolutions a baseball has on its way to home plate after a pitcher releases it Expected Batting Average: A Statcast metric that measures the likelihood that a batted ball will result in a hit
Scraping each week of the 2019 season
## Retrieving Weekly Play-by-play Data
week1_2019 <- scrape_statcast_savant(start_date = "2019-02-28", end_date = "2019-04-07",
player_type = "batter")
week2_2019 <- scrape_statcast_savant(start_date = "2019-04-08", end_date = "2019-04-14",
player_type = "batter")
week3_2019 <- scrape_statcast_savant(start_date = "2019-04-15", end_date = "2019-04-21",
player_type = "batter")
week4_2019 <- scrape_statcast_savant(start_date = "2019-04-22", end_date = "2019-04-28",
player_type = "batter")
week5_2019 <- scrape_statcast_savant(start_date = "2019-04-29", end_date = "2019-05-05",
player_type = "batter")
week6_2019 <- scrape_statcast_savant(start_date = "2019-05-06", end_date = "2019-05-12",
player_type = "batter")
week7_2019 <- scrape_statcast_savant(start_date = "2019-05-13", end_date = "2019-05-19",
player_type = "batter")
week8_2019 <- scrape_statcast_savant(start_date = "2019-05-20", end_date = "2019-05-26",
player_type = "batter")
week9_2019 <- scrape_statcast_savant(start_date = "2019-05-27", end_date = "2019-06-02",
player_type = "batter")
week10_2019 <- scrape_statcast_savant(start_date = "2019-06-03", end_date = "2019-06-09",
player_type = "batter")
week11_2019 <- scrape_statcast_savant(start_date = "2019-06-10", end_date = "2019-06-16",
player_type = "batter")
week12_2019 <- scrape_statcast_savant(start_date = "2019-06-17", end_date = "2019-06-23",
player_type = "batter")
week13_2019 <- scrape_statcast_savant(start_date = "2019-06-24", end_date = "2019-06-30",
player_type = "batter")
week14_2019 <- scrape_statcast_savant(start_date = "2019-07-01", end_date = "2019-07-07",
player_type = "batter")
week15_2019 <- scrape_statcast_savant(start_date = "2019-07-08", end_date = "2019-07-14",
player_type = "batter")
week16_2019 <- scrape_statcast_savant(start_date = "2019-07-15", end_date = "2019-07-21",
player_type = "batter")
week17_2019 <- scrape_statcast_savant(start_date = "2019-07-22", end_date = "2019-07-28",
player_type = "batter")
week18_2019 <- scrape_statcast_savant(start_date = "2019-07-29", end_date = "2019-08-04",
player_type = "batter")
week19_2019 <- scrape_statcast_savant(start_date = "2019-08-05", end_date = "2019-08-11",
player_type = "batter")
week20_2019 <- scrape_statcast_savant(start_date = "2019-08-12", end_date = "2019-08-18",
player_type = "batter")
week21_2019 <- scrape_statcast_savant(start_date = "2019-08-19", end_date = "2019-08-25",
player_type = "batter")
week22_2019 <- scrape_statcast_savant(start_date = "2019-08-26", end_date = "2019-09-02",
player_type = "batter")
week23_2019 <- scrape_statcast_savant(start_date = "2019-09-03", end_date = "2019-09-09",
player_type = "batter")
week24_2019 <- scrape_statcast_savant(start_date = "2019-09-10", end_date = "2019-09-16",
player_type = "batter")
week25_2019 <- scrape_statcast_savant(start_date = "2019-09-17", end_date = "2019-09-23",
player_type = "batter")
week26_2019 <- scrape_statcast_savant(start_date = "2019-09-24", end_date = "2019-09-30",
player_type = "batter")
### Binding rows into single dataframe
Statcast_2019 <- bind_rows(week1_2019, week2_2019, week3_2019, week4_2019, week5_2019, week6_2019,
week7_2019, week8_2019, week9_2019, week10_2019, week11_2019, week12_2019,
week13_2019, week14_2019, week15_2019, week16_2019, week17_2019, week18_2019,
week19_2019, week20_2019, week21_2019, week22_2019, week23_2019, week24_2019,
week25_2019, week26_2019)
### Sample to see if it works
Statcast_sample <- Statcast_2019[sample(nrow(Statcast_2019),10),]
Statcast_sample
## # A tibble: 10 × 92
## pitch_type game_date release_speed release_pos_x release_pos_z player_name
## <chr> <date> <dbl> <dbl> <dbl> <chr>
## 1 SI 2019-05-15 91.5 -1.62 5.8 Maldonado, M…
## 2 FF 2019-09-02 95.6 -1.8 6.87 Dyson, Jarrod
## 3 SL 2019-09-21 82 -2.28 5.57 Ward, Taylor
## 4 SI 2019-09-21 89.7 2.68 6.07 Carpenter, M…
## 5 SL 2019-08-13 76.8 2.82 5.32 Crawford, J.…
## 6 SI 2019-06-15 94.2 -2.43 5.74 Walker, Chri…
## 7 SL 2019-05-20 81.6 3.3 5.84 Acuña Jr., R…
## 8 CH 2019-06-14 87.6 -1.42 5.92 Kipnis, Jason
## 9 FF 2019-06-08 92.4 -2.49 5.74 Smoak, Justin
## 10 SL 2019-08-18 83.3 2.96 5.17 Owings, Chris
## # … with 86 more variables: batter <dbl>, pitcher <dbl>, events <chr>,
## # description <chr>, spin_dir <lgl>, spin_rate_deprecated <lgl>,
## # break_angle_deprecated <lgl>, break_length_deprecated <lgl>, zone <dbl>,
## # des <chr>, game_type <chr>, stand <chr>, p_throws <chr>, home_team <chr>,
## # away_team <chr>, type <chr>, hit_location <dbl>, bb_type <chr>,
## # balls <dbl>, strikes <dbl>, game_year <dbl>, pfx_x <dbl>, pfx_z <dbl>,
## # plate_x <dbl>, plate_z <dbl>, on_3b <dbl>, on_2b <dbl>, on_1b <dbl>, …
write.csv(Statcast_2019, "StatcastHittingData19.csv", row.names = FALSE)
### Storing in SQL database
db <- dbConnect(SQLite(), db_name = "Statcast.sqlite")
dbWriteTable(conn = db, name = "Statcast_Hitting", Statcast_2019, overwrite = TRUE, row.names = FALSE )
### Actual stored object
Statcast_2019 <- dbGetQuery(conn = db, "SELECT * FROM Statcast_Hitting")
EDA - Launch angle vs Exit Velocity
## Batted Ball Type Characteristics Breakdown
kable(Statcast_2019 %>%
mutate("Batted Ball Type" = fct_recode(bb_type, "Line Drive" = "line_drive", "Fly Ball" = "fly_ball", Popup = "popup", "Ground Ball" = "ground_ball")) %>%
group_by(`Batted Ball Type`) %>%
summarize("Average Launch Angle (°)" = round(mean(launch_angle, na.rm=TRUE),1), "Average Launch Speed (MPH)" = round(mean(launch_speed, na.rm=TRUE), 1), "Total" = n()) %>%
filter(!is.na(`Batted Ball Type`)) %>%
arrange(desc(`Average Launch Angle (°)`)))
| Popup |
64.1 |
77.3 |
9119 |
| Fly Ball |
36.5 |
92.0 |
29731 |
| Line Drive |
16.6 |
93.3 |
31488 |
| Ground Ball |
-11.7 |
85.6 |
54958 |
## Hit Type Characteristics Breakdown
kable(Statcast_2019 %>%
mutate("Hit Type" = fct_recode(events, Single = "single", Double = "double", Triple = "triple", "Home Run" = "home_run")) %>%
filter(events %in% c("single", "double", "triple", "home_run")) %>%
group_by(`Hit Type`) %>%
summarize("Average Launch Angle (°)" = round(mean(launch_angle, na.rm=TRUE), 1), "Average Launch Speed (MPH)" = round(mean(launch_speed, na.rm=TRUE), 1), "Total" = n()) %>%
arrange(desc(`Average Launch Angle (°)`)))
| Home Run |
28.2 |
103.5 |
6748 |
| Triple |
20.1 |
97.6 |
783 |
| Double |
16.5 |
97.3 |
8509 |
| Single |
6.6 |
90.3 |
25874 |
## Batted Ball Type Breakdown of Hits
kable(Statcast_2019 %>%
mutate(ab = ifelse(type == "X", TRUE, FALSE), hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE), exb_hit = ifelse(events %in% c("double", "triple", "home_run"), TRUE,FALSE), "Batted Ball Type" = fct_recode(bb_type, "Line Drive" = "line_drive", "Fly Ball" = "fly_ball", Popup = "popup", "Ground Ball" = "ground_ball")) %>%
filter(!is.na(hit)) %>%
select(`Batted Ball Type`, hit, ab, exb_hit) %>%
group_by(`Batted Ball Type`) %>%
summarize("Total Hits" = sum(hit), "Total Extra Base Hits" = sum(exb_hit), "Total Batted Balls" = sum(ab)) %>%
filter(!is.na(`Batted Ball Type`)) %>%
arrange(desc(`Total Hits`)) %>%
mutate("Hits Per Batted Balls (%)" = round((`Total Hits`/`Total Batted Balls`)*100, 1), "Extra Base Hits Per Batted Balls (%)" = round((`Total Extra Base Hits`/`Total Batted Balls`)*100, 1), "Extra Base Hits Per Total Hits (%)" = round((`Total Extra Base Hits`/`Total Hits`)*100, 1)))
| Line Drive |
19699 |
7201 |
31487 |
62.6 |
22.9 |
36.6 |
| Ground Ball |
13119 |
1233 |
54956 |
23.9 |
2.2 |
9.4 |
| Fly Ball |
8904 |
7553 |
29731 |
29.9 |
25.4 |
84.8 |
| Popup |
192 |
53 |
9119 |
2.1 |
0.6 |
27.6 |
Overall Line Drives had the highest overall exit velocity (Launch speed) and were the second highest proportion of batted ball type in 2019
EDA - Launch angle vs exit velocity visualization
### Expected Batting average using launch angle vs exit velocity
parameters <- tibble(launch_angle = c(10, 25, 50),
launch_speed = 40,
label = c("Ground balls", "Line drives", "Fly balls"))
batting_average <- ggplot(Statcast_2019, aes(x = launch_speed, y = launch_angle, color = estimated_ba_using_speedangle)) +
geom_point(alpha = .05) +
geom_hline(data = parameters, aes(yintercept = launch_angle), color = "black", linetype = "dotted") +
geom_text(data = parameters, aes(label = label, y = launch_angle - 4), color = "black", hjust = "left") +
scale_color_gradient("xBA", low = "#6ac3cd", high = "#fc766a", label = percent)+
theme_bw() +
labs(title = "Launch angle vs. exit velocity in 2019",
subtitle = "Line drives and fly balls with high exit velocity lead to a higher expected batting average",
caption = "Data: baseballr | Bill Petti",
x = "Exit Velocity (mph)",
y = "Launch Angle (degrees)") +
theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
plot.subtitle = element_text(size = 9, hjust = 0.5),
legend.title = element_text(hjust = 0.5),
plot.caption.position = "plot",
panel.grid = element_blank()) +
scale_y_continuous(breaks = seq(-75, 75, 25))
batting_average

This graph aims to show the relationship between launch angle and exit velocity and the chance that a batter will achieve a hit. The color is mapped by estimated batting average, which is a Statcast tool used to predict a batters chance of getting a hit based on their exit velocity and launch angle. The main takeaway that we see from this graph is that hitting the ball hard at an optimal launch angle gives the batter the best chance at reaching base via a hit. This can be seen in the saturated pockets of the graph with the orange color.
### Expected wOBA using launch angle vs exit velocity
woba <- ggplot(Statcast_2019, aes(x = launch_speed, y = launch_angle, color = estimated_woba_using_speedangle)) +
geom_point(alpha = .05) +
geom_hline(data = parameters, aes(yintercept = launch_angle), color = "black", linetype = "dotted") +
geom_text(data = parameters, aes(label = label, y = launch_angle - 4), color = "black", hjust = "left") +
scale_color_gradient("xwOBA", low = "#6ac3cd", high = "#fc766a", label=percent) +
theme_bw() +
labs(title = "Launch angle vs. exit velocity in 2019",
subtitle = "Line drives and fly balls with high exit velocity lead to a higher expected wOBA",
caption = "Data: baseballr | Bill Petti",
x = "Exit Velocity (mph)",
y = "Launch Angle (degrees)") +
theme(plot.title = element_text(face = "bold", size = 11, hjust = 0.5),
plot.subtitle = element_text(size = 9, hjust = 0.5),
legend.title = element_text(hjust = 0.5),
plot.caption.position = "plot",
panel.grid = element_blank()) +
scale_y_continuous(breaks = seq(-75, 75, 25))
woba

This is a very similar graph to the one above, but instead of using batting avgerage as the predictor, we are using weighted on base average (wOBA). wOBA is similar to batting average, but instead of assuming that all hits are created equally, wOBA attempts to capture all offensive value by giving different weights to different types of offensive outcomes (i.e. single, double, triple, home run). This is why in this graph there is an even more saturated pocket that can be seen. It makes it more apparent that this pocket will represent the more valuable types of hits such as doubles, triples and home runs. This balls in this pocket are generally hit harder and at a more optimal launch angle than some of the points on the previous graph.
Finding Ideal Launch Angle
## Launch Angle vs. Hit Distance vs. Hit Type for Each Batted Ball Type
Statcast_2019 %>%
mutate(hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE)) %>%
filter(hit == TRUE ) %>%
ggplot(aes(launch_angle,hit_distance_sc, color = fct_relevel(events, levels = c("single", "double", "triple", "home_run")))) +
geom_point(na.rm=TRUE, alpha=0.5) +
labs(x="Launch Angle", y="Hit distance", title="Ready to Launch", subtitle = "Hit Distance and Type of Hit Balls in regard to Launch Angle") +
scale_color_discrete_qualitative(palette="Dark 3", name="Hit Type") +
theme_gray() +
facet_grid(fct_relevel(bb_type, levels = c("ground_ball", "line_drive", "fly_ball", "popup"))~., shrink=TRUE) +
geom_vline(xintercept =c(15,35), color = "black", linetype = "dashed")

## 3D Scatter-plot Incorporating Exit Velocity
Statcast_2019 %>%
mutate(hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE, FALSE)) %>%
filter(type == "X", hit == TRUE) %>%
plot_ly(x = ~launch_angle, y = ~launch_speed, z = ~hit_distance_sc, color = ~events, colors = c("#6ac3cd","#fc766a", "green", "purple"), position="jitter") %>%
add_markers() %>%
add_lines(x=15, y=NULL, z=NULL, color=NULL) %>%
add_lines(x=35, y=NULL, z=NULL, color=NULL) %>%
layout(scene = list(xaxis = list(title = 'Launch Angle'),
yaxis = list(title = 'Launch Speed'),
zaxis = list(title = 'Hit Distance',
trace4 = "15°")))
In the first graphic, by breaking down each batted ball into their respective category it allows a better visual interpretation of the proportion of hit types and distance each produce. Further, this displays how the distribution of the launch of different types of batted balls overlap. This being because the flight path of the ball coming off of the bat is impacted by the spin direction of the pitch, thus producing different outcomes. Including all batted ball outcomes the launch angle range with the highest average distance and proportion of hits, as well as extra-base hits, is between 15° and 35°. This range allowing for greatest potential in distance as well as hit type since batted ball outcome is not directly attributable to launch angle, and thus can not be controlled for. Looking now at the 3D Scatter-plot it displays a downward converging trend in both exit velocity and hit distance as launch angle increases past 35°. Here in the graphic trace 4 represents the lower limit of the ideal launch range (15°) and trace 5 the upper (35°). Further, for launch angles less than 15° there is a significantly greater portion or low distance and velocity “single” hit type outcomes, displaying that in this range there is a much higher likelihood of this resulting outcome than within the 15° to 35° range.
## Launch Angle vs. Exit Velocity vs. Hit Distance
paras <- data.frame(label = c("15°","35°", "95 (MPH)"), point = c(20,40,85), y=c(5,5, 100))
Statcast_2019 %>%
mutate(ila = ifelse(launch_angle^2 >=15^2 & launch_angle^2 <= 35^2, "Ideal", ifelse(launch_angle^2 < 15^2,"Below", "Above"))) %>%
ggplot(aes(launch_angle^2,launch_speed^2, color = hit_distance_sc)) +
geom_point(na.rm=TRUE, alpha=0.5) +
geom_vline(xintercept=c(15^2,35^2), color = "blue", linetype = "dashed") +
geom_hline(yintercept = 95^2, color="blue", linetype="dashed") +
geom_smooth() +
scale_color_continuous_sequential("Sunset") +
geom_text(data=paras, aes(label = label, x=point^2, y=y^2), color="blue") +
theme_minimal() +
labs(x="Launch Angle", y="Exit Velocity (MPH)", title="Liftoff!", subtitle="Relationship between launch angle and speed to hit distance", color="Hit Distance") +
theme(axis.text = element_blank(),
axis.ticks = element_blank())

By squaring the x and y axis variable it allow a greater display relationship that each unit increase in these variable interact with each other, and with hit distance. Along with the vertical limit lines displaying our theorized ideal launch angle range there is a horizontal limit displaying the threshold of batted balls deemed “Hard Hit” (>=95MPH launch speed). This displaying that overall this range produces the greatest combination of exit velocity and hit distance than those outside this range.
# Launch Angle Categories Breakdown of Batted Balls
kable(Statcast_2019 %>%
mutate(ila = ifelse(launch_angle >=15 & launch_angle <= 35, "Ideal (15-35 degrees)", ifelse(launch_angle < 15,"Below 15 degrees", "Above 35 degrees")), hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE), exb_hit = ifelse(events %in% c("double", "triple", "home_run"), TRUE,FALSE), batted = ifelse(type =="X", TRUE, FALSE)) %>%
select(ila, launch_speed, hit_distance_sc, hit, exb_hit, batted) %>%
group_by(ila) %>%
summarize("Average Hit Distance" = mean(hit_distance_sc, na.rm=TRUE), "Average Launch Speed" = mean(launch_speed, na.rm=TRUE), "Total Hits" = sum(hit), "Total Extra Base Hits" = sum(exb_hit), "Total In Play" = sum(batted)) %>%
arrange(desc(`Average Hit Distance`)) %>%
mutate("Proportion of Hits to Batted Balls" = `Total Hits`/`Total In Play`, "Proportion of Extra Base Hits to Hits" = `Total Extra Base Hits` / `Total Hits`, "Launch Angle Category" = ila) %>%
select(`Launch Angle Category`, `Average Hit Distance`, `Average Launch Speed`, `Proportion of Hits to Batted Balls`, `Proportion of Extra Base Hits to Hits`) %>%
filter(!is.na(`Launch Angle Category`)))
| Ideal (15-35 degrees) |
276.36408 |
86.59104 |
0.5208381 |
0.6459439 |
| Above 35 degrees |
222.85339 |
79.22390 |
0.0635940 |
0.6271871 |
| Below 15 degrees |
60.56468 |
84.82648 |
0.3339746 |
0.1493617 |
## Launch Angle Categories Breakdown of Hard Hit Batted Balls (Exit Velocity >= 95 MPH)
kable(Statcast_2019 %>%
filter(launch_speed >= 95) %>%
mutate(ila = ifelse(launch_angle >=15 & launch_angle <= 35, "Ideal (15-35 degrees)", ifelse(launch_angle < 15,"Below 15 degrees", "Above 35 degrees")), hit = ifelse(events %in% c("single", "double", "triple", "home_run"), TRUE,FALSE), exb_hit = ifelse(events %in% c("double", "triple", "home_run"), TRUE,FALSE), batted = ifelse(type =="X", TRUE, FALSE)) %>%
select(ila, launch_speed, hit_distance_sc, hit, exb_hit, batted) %>%
group_by(ila) %>%
summarize("Average Hit Distance" = mean(hit_distance_sc, na.rm=TRUE), "Total Hits" = sum(hit), "Total Extra Base Hits" = sum(exb_hit), "Total In Play" = sum(batted)) %>%
arrange(desc(`Average Hit Distance`)) %>%
mutate("Proportion of Hits to Batted Balls" = `Total Hits`/`Total In Play`, "Proportion of Extra Base Hits to Hits" = `Total Extra Base Hits` / `Total Hits`, "Launch Angle Category" = ila) %>%
select(`Launch Angle Category`, `Total Hits`, `Total Extra Base Hits`, `Total In Play`, `Average Hit Distance`, `Proportion of Hits to Batted Balls`, `Proportion of Extra Base Hits to Hits`) %>%
filter(!is.na(`Launch Angle Category`)))
| Ideal (15-35 degrees) |
10937 |
9943 |
17005 |
359.87508 |
0.6431638 |
0.9091158 |
| Above 35 degrees |
662 |
654 |
4080 |
330.67847 |
0.1622549 |
0.9879154 |
| Below 15 degrees |
12792 |
2298 |
24559 |
96.91968 |
0.5208681 |
0.1796435 |
Batted balls that had launch angles within the 15° to 35° range an average distance 50 feet greater than the next highest category. Further, it possessed the greatest average launch speed as well as a significantly greater proportion of the batted balls resulting in a hit, and of those hits being extra base hits as well. Next isolating batted balls deemed to be hardly hit the ideal range had the highest total resulting in extra base hits, the highest average distance, as well as the highest proportion of balls resulting in hits. These tables displaying how the other ranges outside the ideal still result in a significant amount of hits, as well as hard hit balls but balls hit within the ideal range have the greatest potential for distance as well as becoming a hit.
Hit Distance Prediction Model
## Transforming and Re-leveling variables
s_cr <- Statcast_2019 %>%
mutate(z_r = ifelse(zone < 7, ifelse(zone < 4, "low", "middle"), "high"), la_sq = launch_angle^2, ila = ifelse(launch_angle >=15 & launch_angle <= 35, "Ideal (15-35 degrees)", ifelse(launch_angle < 15,"Below 15 degrees", "Above 35 degrees")), ila = fct_relevel(ila, levels=c("Ideal (15-35 degrees)", "Below 15 degrees", "Above 35 degrees")), bb_type = fct_relevel(bb_type, levels=c("line_drive", "ground_ball", "fly_ball", "popup")), ls_sqr = sqrt(launch_speed), warm = ifelse(effective_speed >= 90, "Gas", ifelse(effective_speed <= 75, "slow", "Norm")))
## Creating Training and Test sets
split_dummy <- sample(c(rep(0, 0.7 * nrow(s_cr)), rep(1, 0.3 * nrow(s_cr))))
data_train <- s_cr[split_dummy == 0, ]
data_test <- s_cr[split_dummy == 1, ]
The variables that were chosen to be placed within the model are: effective pitch speed, release spin rate, the interaction of effective pitch speed and release spin rate, launch speed, launch angle, the interaction of launch speed angle, batted ball type, and pitch location. In transforming the data to be applied to the model the launch angle was converted to a dummy categorical variable specifying the ideal or above/below ranges, the pitch location variable was divided into three categories of high/middle/low, and the batted ball type was re-leveled with line drive placed first.
## Training Model
my_lm <- lm(hit_distance_sc ~ release_spin_rate*effective_speed + ila + launch_speed + bb_type + launch_speed_angle + z_r, data = data_train)
summary(my_lm)
##
## Call:
## lm(formula = hit_distance_sc ~ release_spin_rate * effective_speed +
## ila + launch_speed + bb_type + launch_speed_angle + z_r,
## data = data_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -188.255 -22.168 -3.214 25.271 161.898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.677e+02 1.454e+01 11.531 < 2e-16 ***
## release_spin_rate -1.997e-02 6.273e-03 -3.183 0.001460 **
## effective_speed -6.384e-01 1.707e-01 -3.739 0.000185 ***
## ilaBelow 15 degrees -9.999e+01 5.255e-01 -190.285 < 2e-16 ***
## ilaAbove 35 degrees -2.800e+01 5.739e-01 -48.790 < 2e-16 ***
## launch_speed 8.675e-01 1.281e-02 67.703 < 2e-16 ***
## bb_typeground_ball -1.130e+02 5.236e-01 -215.726 < 2e-16 ***
## bb_typefly_ball 4.724e+01 4.987e-01 94.724 < 2e-16 ***
## bb_typepopup -6.317e+01 9.113e-01 -69.322 < 2e-16 ***
## launch_speed_angle 2.426e+01 1.808e-01 134.183 < 2e-16 ***
## z_rlow 2.510e+00 4.095e-01 6.130 8.85e-10 ***
## z_rmiddle 7.984e-01 3.056e-01 2.613 0.008984 **
## release_spin_rate:effective_speed 2.384e-04 7.389e-05 3.227 0.001253 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38 on 78456 degrees of freedom
## (432387 observations deleted due to missingness)
## Multiple R-squared: 0.9251, Adjusted R-squared: 0.9251
## F-statistic: 8.076e+04 on 12 and 78456 DF, p-value: < 2.2e-16
kable(data.frame(my_lm$coefficients) %>%
rownames_to_column(var = "Model Variables"))
| (Intercept) |
167.6833899 |
| release_spin_rate |
-0.0199659 |
| effective_speed |
-0.6384398 |
| ilaBelow 15 degrees |
-99.9894977 |
| ilaAbove 35 degrees |
-27.9986276 |
| launch_speed |
0.8675357 |
| bb_typeground_ball |
-112.9609979 |
| bb_typefly_ball |
47.2432600 |
| bb_typepopup |
-63.1743010 |
| launch_speed_angle |
24.2560284 |
| z_rlow |
2.5101442 |
| z_rmiddle |
0.7984426 |
| release_spin_rate:effective_speed |
0.0002384 |
## Test Model
my_lm2 <- lm(hit_distance_sc ~ effective_speed*release_spin_rate + ila + launch_speed + bb_type + launch_speed_angle, data = data_test)
summary(my_lm2)
##
## Call:
## lm(formula = hit_distance_sc ~ effective_speed * release_spin_rate +
## ila + launch_speed + bb_type + launch_speed_angle, data = data_test)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184.011 -22.464 -3.052 25.362 157.349
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.985e+02 2.262e+01 8.773 < 2e-16 ***
## effective_speed -1.033e+00 2.658e-01 -3.887 0.000102 ***
## release_spin_rate -3.585e-02 9.770e-03 -3.669 0.000243 ***
## ilaBelow 15 degrees -9.973e+01 8.157e-01 -122.264 < 2e-16 ***
## ilaAbove 35 degrees -2.841e+01 8.817e-01 -32.227 < 2e-16 ***
## launch_speed 8.988e-01 1.992e-02 45.132 < 2e-16 ***
## bb_typeground_ball -1.120e+02 8.106e-01 -138.151 < 2e-16 ***
## bb_typefly_ball 4.763e+01 7.668e-01 62.111 < 2e-16 ***
## bb_typepopup -6.277e+01 1.417e+00 -44.286 < 2e-16 ***
## launch_speed_angle 2.456e+01 2.782e-01 88.277 < 2e-16 ***
## effective_speed:release_spin_rate 4.210e-04 1.151e-04 3.658 0.000254 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.37 on 33548 degrees of freedom
## (185378 observations deleted due to missingness)
## Multiple R-squared: 0.9239, Adjusted R-squared: 0.9239
## F-statistic: 4.074e+04 on 10 and 33548 DF, p-value: < 2.2e-16
kable(data.frame(my_lm2$coefficients) %>%
rownames_to_column(var = "Model Variables"))
| (Intercept) |
198.4665681 |
| effective_speed |
-1.0330593 |
| release_spin_rate |
-0.0358503 |
| ilaBelow 15 degrees |
-99.7330183 |
| ilaAbove 35 degrees |
-28.4133500 |
| launch_speed |
0.8988250 |
| bb_typeground_ball |
-111.9883970 |
| bb_typefly_ball |
47.6276931 |
| bb_typepopup |
-62.7710412 |
| launch_speed_angle |
24.5622019 |
| effective_speed:release_spin_rate |
0.0004210 |
# Ranking Predictors
kable(data.frame(anova(my_lm2) %>% mutate(percent_var = `Sum Sq`/sum(`Sum Sq`), mean_percent_var = `Mean Sq`/sum(`Sum Sq`))))
| effective_speed |
1 |
300429.30 |
3.004293e+05 |
204.07568 |
0.0000000 |
0.0004628 |
0.0004628 |
| release_spin_rate |
1 |
1092030.27 |
1.092030e+06 |
741.79456 |
0.0000000 |
0.0016823 |
0.0016823 |
| ila |
2 |
456135609.84 |
2.280678e+08 |
154921.94816 |
0.0000000 |
0.7027087 |
0.3513544 |
| launch_speed |
1 |
59703300.97 |
5.970330e+07 |
40555.27128 |
0.0000000 |
0.0919771 |
0.0919771 |
| bb_type |
3 |
70992657.85 |
2.366422e+07 |
16074.63603 |
0.0000000 |
0.1093691 |
0.0364564 |
| launch_speed_angle |
1 |
11479193.60 |
1.147919e+07 |
7797.58913 |
0.0000000 |
0.0176845 |
0.0176845 |
| effective_speed:release_spin_rate |
1 |
19702.68 |
1.970268e+04 |
13.38364 |
0.0002542 |
0.0000304 |
0.0000304 |
| Residuals |
33548 |
49387571.03 |
1.472147e+03 |
NA |
NA |
0.0760850 |
0.0000023 |
## Plotting the Model's Residuals
autoplot(my_lm2, which=1:3)

Looking at the power of the model it is significantly high with a total of ~92.5% of the variation in hit distance being explained. Further, the standard error of the model being ~38 feet is permissible since there are intangible variables that effect distance that are or can not be included in the model. It appears that in the model the predictors which positively effect distance are: launch speed, interaction of launch speed and angle, interaction of effective pitch speed and release spin rate, pitch location middle/low, batted ball type fly ball. Further, those with negative relationships, or reduce hit distance, were effective pitch speed, release spin rate, launch angle range above 35°/below 15°, and batted ball type pop up/ground ball. This effectively displaying hitting hard (above 95MPH exit velocity) line drives or fly balls that have a launch angle of 15° to 35°, on characteristically fastballs, is the best theoretical approach to hitting according to the model. These batted balls having the highest potential for overall hit distance, which is correlated with a greater proportion hits as well as extra base hits. It does appear that for balls that are within 200 feet the model innacurately predicts them but these balls are exposed to a much higher degree of intangible effect and thus are outliers in regard to the majority of batted balls, thus not displaying true heteroscedasity in the residuals.
## Expected Batting Average by Pitch Location
ehit_plot(Statcast_2019, title="Pitch Location and Expected Batting Average") +
theme_classic() +
scale_fill_viridis(option="inferno", name="Expected Batting Average", label=percent)

## Average Launch Angle by Pitch Location
la_plot(Statcast_2019, title ="Pitch Location and Launch Angle") +
theme_classic() +
scale_fill_viridis(option="inferno", name="Projected Launch Angle")

## Home Run Probability by Pitch Location
home_run_plot(Statcast_2019, title ="Pitch Location and Homeruns") +
theme_classic() +
scale_fill_viridis(option="inferno", name="Probability of Homeruns", label = percent)

Plotting expected batting average, average launch angle, and home run probability in regard to pitch location in the strike zone better displays the effects of pitch location on hit distance than the model coefficients. Independently launch angle is highest for pitches up and center in the strike zone, expected batting average is highest in the lower heart of the strike zone, and home run probability is highest in the direct center of the strike zone. Thus, the lower bound of the ideal launch angle range begins in the lower heart of the strike zone. Thus, while the average of our ideal range is displayed to result from high pitches in the strike zone to a greater degree those pitches occurring in the middle and lower portions of the strike zone produce a much higher probability of the batted ball being a hit in general as well as a home run. Therefor displaying that while the ideal launch angle is seen more in the higher portion of the strike zone it are the pitches within the middle and lower portion of the zone that provide a better chance of getting a hit and greater hit distance as well.